Lab: Multivariate Analysis

Author

Wolfgang Huber, Susan Holmes, TAs of the Stanford class ‘Modern Statistics for Modern Biology’, Julia Ruehle, Helena Crowell

Published

July 10, 2024

Goal

In this lab we will learn the basics of multivariate analysis and PCA using a few simple examples.

Work through this lab by running all the R code to your computer and making sure that you understand the input and the output. Make alterations where you seem fit. We encourage you to work through this lab with a partner.

Setup

The following code installs needed packages

install_if_needed = function(x) {
  p = setdiff(x, installed.packages())
  if (length(p) > 0) BiocManager::install(p)
}
install_if_needed(c("GGally", "factoextra", "ade4"))

Obtain data sets that we will be working with.

turtles = read.table(url("https://web.stanford.edu/class/bios221/data/PaintedTurtles.txt"),
                     header = TRUE)
head(turtles)
  sex length width height
1   f     98    81     38
2   f    103    84     38
3   f    103    86     42
4   f    105    86     40
5   f    109    88     44
6   f    123    92     50
download.file(url = "https://web.stanford.edu/class/bios221/data/athletes.RData",
              destfile = "athletes.RData", mode = "wb")
load("athletes.RData")
athletes[1:3, ]
   m100 long weight highj  m400  m110  disc pole javel  m1500
1 11.25 7.43  15.48  2.27 48.90 15.13 49.28  4.7 61.32 268.95
2 10.87 7.45  14.97  1.97 47.71 14.46 44.36  5.1 61.76 273.02
3 11.18 7.44  14.20  1.97 48.29 14.81 43.66  5.2 64.16 263.20

Let’s first get to know our data sets.

Questions
  1. How many athletes / turtles do you have in the data sets?
  2. What’s the record distance in the longjump category? And which athlete (number) made this record?
  3. What’s the average time across all athletes for the 100m run?
  4. Can you plot the histogram showing the distribution of the times for the 100m run?
  5. How many athletes of those who run faster than the average in the 100m run, also run faster than the average in the 1500m distance?
#1
nrow(athletes)
[1] 33
nrow(turtles)
[1] 48
#2
max(athletes$long)
[1] 7.72
which.max(athletes$long)
[1] 6
#3 
mean(athletes$m100)
[1] 11.19636
#4
hist(athletes$m100)

#5 
av100  = mean(athletes$m100)
av1500 = mean(athletes$m1500)

sum( (athletes$m100 > av100) & (athletes$m1500 > av1500) )
[1] 8
#5 alternative solution
with(athletes, table(m100 > mean(m100), m1500 > mean(m1500)))
       
        FALSE TRUE
  FALSE    13    5
  TRUE      7    8

Low dimensional data summaries and preparation

It is instructive to first consider 2-dimensional summaries of the data. The function ggpairs from the GGally package gives a nice summary of the features and how they are correlated with each other.

library("GGally")
ggpairs(turtles[, -1], axisLabels = "none")

Questions
  1. What do you see on the diagonal? What do the stars indicate next to the correlation value?
  2. Can you repeat this plot for the athletes data?
  3. In the lecture, we have seen another way to investigate correlations in the data. Use the pheatmap function in the package with the same name, pheatmap, to illustrate the pairwise correlations of the features in the athletes data set.
#1 
# Diagonal: histogram displaying the distribution of the different variables.
# Stars: significant correlation between the two variables  

#2 
ggpairs(athletes, axisLabels = "none")

#3
library("pheatmap")
mycolors = colorRampPalette(c("#FFD800", "#0056B9"))(100)
pheatmap(cor(athletes), cellwidth = 20, cellheight = 20,
         color = mycolors, breaks = seq(-1, +1, length.out = length(mycolors) + 1))

Preprocessing the data

In many cases, different variables are measured in different units and at different scales. As discussed in the lecture, we have various options to transform the data. Here, we elect to standardize the data to a common standard deviation. This rescaling is done using the scale function, which subtracts the mean and divides by the standard deviation, so that every column has a unit standard deviation and mean zero.

scaledTurtles = data.frame(scale(turtles[, -1]), sex = turtles[, 1])
head(scaledTurtles)
       length      width     height sex
1 -1.30299868 -1.1389780 -0.9929102   f
2 -1.05887715 -0.9023072 -0.9929102   f
3 -1.05887715 -0.7445267 -0.5163133   f
4 -0.96122853 -0.7445267 -0.7546117   f
5 -0.76593131 -0.5867462 -0.2780148   f
6 -0.08239102 -0.2711852  0.4368805   f
Questions
  1. Can you compute the standard deviation and mean of each column in the turtles data frame? Can you do the same on the scaled dataset, i.e. on scaledturtles? What was the mean of turtles’ heights before standardizing?
apply(turtles[, -1], 2, sd)
   length     width    height 
20.481602 12.675838  8.392837 
apply(scaledTurtles[, -4], 2, sd)
length  width height 
     1      1      1 
apply(turtles[, -1], 2, mean)
   length     width    height 
124.68750  95.43750  46.33333 

We can visualize two columns/dimensions (for example height and width) of the scaled data using ggplot.

library("ggplot2")
ggplot(scaledTurtles,aes(x = width,y = height, group = sex)) +
  geom_point(aes(color = sex)) + coord_fixed()

What is the purpose of the coord_fixed() modifier here?

Dimensionality reduction

In this part, we will use geometrical projections of points in a higher dimensional space and project them down to lower dimensions.

The first example will be the projection of the points in a two-dimensional space (defined by weight and disc distance in the athlete data set) onto a 1-dimensional space. The 1-dimensional space in this case is defined by the weight-axis/x-axis.

But first we need to scale the athlete data set, in the same way as we did it with the turtles data set.

scaledathletes = data.frame(scale(athletes))
n = nrow(scaledathletes)
# First, p is a 2-dimensional plot of the points defined by weight (x) and disc (y)
p = ggplot(scaledathletes, aes(x = weight, y = disc)) + geom_point(shape = 1)

# Then we add the projected points (red) & the projection lines (dashed)
p + geom_point(aes(y = rep(0, n)), color = "#0056B9") +
    geom_segment(aes(xend = weight, yend = rep(0, n)), linetype = "dashed")

Questions

Now try to do the following:

  1. Calculate the standard deviation of the blue points (their \(x\)-coordinates) in the above figure.

  2. Make a similar plot showing projection lines onto the \(y\)-axis and show projected points in yellow. What is the variance of the projected points now?

#1
sd(scaledathletes$weight)
[1] 1
#2
p + geom_point(aes(x = rep(0, n)), color = "#0056B9") +
  geom_segment(aes(yend = disc, xend = rep(0, n)), linetype = "dashed")

sd(scaledathletes$disc)
[1] 1

Summarize 2D-data by a line

In the above example when projecting the 2-dimensional points to the weight axis, we lost the disc information. In order to keep more information, we will now project the 2 dimensional point cloud onto another line.

For this, we first compute a linear model to find the regression line using the lm function (linear model). We regress disc on weight. The regression line is defined by two parameters: its slope and its intercept. The slope a is given by the second coefficient in the output of lm and its intercept b is the first coefficient:

reg1 = lm(disc ~ weight, data = scaledathletes)

Extract intercept and slope values

a = reg1$coefficients[1] # Intercept
b = reg1$coefficients[2] # slope

Plot the points p (computed in the code section before) and the regression line.

pline = p + geom_abline(intercept = a, slope = b, col = "blue", lwd = 1.5) + coord_fixed()

Add the projection lines (from the point to its fitted value)

pline + geom_segment(aes(xend = weight, yend = reg1$fitted),
                     color = "#FFD800", arrow = arrow(length = unit(0.15, "cm")))

Question

Can you regress weight on discs and generate a similar plot?

#1
reg2 = lm(weight ~ disc, data = scaledathletes)

# Extract the intercept and slope values   
a = reg2$coefficients[1] # Intercept
b = reg2$coefficients[2] # slope

# Plot the points p (computed in the code section before) & the regression line 

p = ggplot(scaledathletes, aes(x = disc, y = weight)) + geom_point(shape = 1) + coord_fixed()
newline = p + geom_abline(intercept = a, slope = b, col = "#0056B9")

# Add the projection lines (from the point to its fitted value)
newline + geom_segment(
  aes(y = weight, x = disc, yend = reg2$fitted, xend = disc), 
  color = "#FFD800", 
  arrow = arrow(length = unit(0.15, "cm"))) + coord_flip()

Question

Can you create a plot that shows all points, as well as both regression lines, i.e., a plot that show both the line you get from lm(disc ~ weight) and lm(weight ~ disc)?

We plot the data such that the \(x\)-axis is disc and the \(y\)-axis is weight. So we can directly use the intercept and slope parameters from the first regression, reg1. For the second regression, reg2, we invert \[\begin{align} y&=a+bx\quad\quad\Rightarrow\\ x&=-\frac{a}{b}+\frac{1}{b}y \end{align}\]

a1 = reg1$coefficients[1]
b1 = reg1$coefficients[2]
a2 = reg2$coefficients[1]
b2 = reg2$coefficients[2]
ggplot(scaledathletes, aes(x = disc, y = weight)) + geom_point(shape = 1) + 
  coord_fixed() + 
  geom_abline(intercept = a1, slope = b1 , col = "#0056B9") +
  geom_abline(intercept = -a2/b2, slope = 1/b1, col = "#FFD800")

A line that minimizes distances in both directions

Below we are plotting a line chosen to minimize the error in both the horizontal and vertical directions. This results in minimizing the diagonal projections onto the line.

Specifically, we compute a line that minimizes the sum of squares of the orthogonal (perpendicular) projections of data points onto it; we call this the principal component line (purple).

X = cbind(scaledathletes$disc, scaledathletes$weight)
svda = svd(X)
pc = X %*% svda$v[, 1] %*% t(svda$v[, 1])
bp = svda$v[2, 1] / svda$v[1, 1]
ap = mean(pc[, 2]) - bp * mean(pc[, 1])

p + geom_segment(xend = pc[, 1], yend = pc[, 2], arrow = arrow(length = unit(0.15, "cm"))) + 
  geom_abline(intercept = ap, slope = bp, col = "purple", lwd = 1.5) + 
  coord_fixed()
Coordinate system already present. Adding new coordinate system, which will
replace the existing one.

Now let’s see how we can use the learned on a higher-dimensional data set.

Turtle PCA

To start we will come back to the turtles data set. First, we need to make sure we understand the basic features of the data and preprocess it in a way that its in the correct “shape” for running the PCA analysis.

Questions
  1. What are the mean values and standard deviation, of each of the 3 features: length, width and height.
  2. Scale the data.
  3. Explore the correlations between the 3 variables after scaling the data. What do you see?
#1 
apply(turtles[, -1], 2, mean)
   length     width    height 
124.68750  95.43750  46.33333 
apply(turtles[, -1], 2, sd)
   length     width    height 
20.481602 12.675838  8.392837 
#2
turtlesc = scale(turtles[, -1])

#3
corrs = cor(turtlesc)
corrs |> round(3)
       length width height
length  1.000 0.978  0.965
width   0.978 1.000  0.961
height  0.965 0.961  1.000
pheatmap(corrs, cellwidth = 40, cellheight = 40, color = mycolors)

From the correlations, you see that all 3 variables are strongly correlated. (In the heatmap, note that the color scale already starts with a high value at its lower end.) Hence we expect that the data can be well approximated by a single variable. Let’s do the PCA:

library("factoextra")
pca1 = princomp(turtlesc)
pca1
Call:
princomp(x = turtlesc)

Standard deviations:
   Comp.1    Comp.2    Comp.3 
1.6954576 0.2048201 0.1448180 

 3  variables and  48 observations.

To look at the relative importance of the principal components, we can look at their variances: the screeplot. The screeplot shows the eigenvalues for the standardized data.

fviz_eig(pca1, geom = "bar", width = 0.4)

Note: Here we see one very large component in this case and two very small ones. In this case the data are (almost) one dimensional.

Questions
  1. What is the percentage of variance explained by the first PC? How can you obtain this value from the pca1 object?
  2. How many PCs are you using if you want to project the turtles data set?
#1
summary(pca1)
Importance of components:
                          Comp.1     Comp.2      Comp.3
Standard deviation     1.6954576 0.20482013 0.144818032
Proportion of Variance 0.9785792 0.01428129 0.007139494
Cumulative Proportion  0.9785792 0.99286051 1.000000000
#2
# One PC would be sufficient.

Now, lets plot the samples with their PC1 and PC2 coordinates, together with the variables. The representation of both, the samples and the variables is called a biplot.

fviz_pca_biplot(pca1, label = "var") 

Questions
  1. Can you extend this plotting code to color the female samples differently than the male samples?
  2. Did the males or female turtles tend to be larger?
#6
fviz_pca_biplot(pca1, label = "var", col.ind = turtles[,1]) 

#7
# Females 

Back to the athletes

Now let us try to run the PCA on a larger data set and interpret the corresponding scree plot. In this case we are using a different library, with a slightly different output of the PCA computation. But the principle is the same.

library("ade4")
# The dudi.pca function by default already centers and scales the data by itself
pca.ath = dudi.pca(athletes, scannf = FALSE)
pca.ath$eig
 [1] 3.4182381 2.6063931 0.9432964 0.8780212 0.5566267 0.4912275 0.4305952
 [8] 0.3067981 0.2669494 0.1018542
Questions
  1. Just like in the above turtle data set. Can you produce a scree plot?
  2. How many PCs are you using if you want to project the athletes data set?
  3. Can you plot the samples with their PC1 and PC2 coordinates, together with the variables in a biplot?
  4. Can you plot the numbers of the athletes onto the samples. What do you notice about the numbers?
#1
fviz_eig(pca.ath, geom = "bar", bar_width = 0.3) + ggtitle("")

#2
#     Somewhere between 2 and 4 

#3
fviz_pca_biplot(pca.ath, label = "var") 

#4
fviz_pca_ind(pca.ath) + ggtitle("") + ylim(c(-2.5, 5.7))


Back to the single cell dataset

Dependencies

library(dplyr)
library(scran)
library(scater)
library(ggplot2)
library(ExperimentHub)

Data retrieval

This is what we did yesterday, open the Rmarkdown document that you saved yesterday and run them again.

eh <- ExperimentHub()
q <- query(eh, "Kang18")
(sce <- eh[[q$ah_id]])
see ?muscData and browseVignettes('muscData') for documentation
loading from cache
class: SingleCellExperiment 
dim: 35635 29065 
metadata(0):
assays(1): counts
rownames(35635): MIR1302-10 FAM138A ... MT-ND6 MT-CYB
rowData names(2): ENSEMBL SYMBOL
colnames(29065): AAACATACAATGCC-1 AAACATACATTTCC-1 ... TTTGCATGGTTTGG-1
  TTTGCATGTCTTAC-1
colData names(5): ind stim cluster cell multiplets
reducedDimNames(1): TSNE
mainExpName: NULL
altExpNames(0):

Quality control

sce <- sce[rowSums(counts(sce) > 0) > 0, ]
sce <- addPerCellQC(sce)
sce <- sce[, !isOutlier(sce$detected, nmads=2, log=TRUE)]
sce <- sce[rowSums(counts(sce) > 1) >= 10, ]

Analysis continued

Now we start with the new part

Feature selection

For each gene, we compute the variance and mean of the log-expression values. A trend is fitted to the variance against the mean for all genes. The fitted value for each gene is used as a proxy for the technical component of variation for each gene, under the assumption that most genes exhibit a low baseline level of variation that is not biologically interesting. The biological component of variation for each gene is defined as the the residual from the trend. (see ?modelGeneVar)

# log-library size normalization
sce <- logNormCounts(sce) 
# model gene expr. mean vs. var.
df <- modelGeneVar(sce) 
head(df)
DataFrame with 6 rows and 6 columns
              mean     total      tech        bio      p.value          FDR
         <numeric> <numeric> <numeric>  <numeric>    <numeric>    <numeric>
NOC2L    0.0696493 0.0819799 0.0775952 0.00438472  3.59549e-01  7.40648e-01
HES4     0.0982435 0.1261553 0.1088819 0.01727339  1.56308e-01  7.40648e-01
ISG15    2.1064758 4.6389017 1.0478233 3.59107844 8.60522e-106 6.12519e-102
TNFRSF18 0.0682297 0.1015325 0.0760317 0.02550081  1.63899e-02  6.90315e-01
TNFRSF4  0.0583023 0.0831188 0.0650732 0.01804558  3.87765e-02  7.40648e-01
SDF4     0.0792608 0.0918320 0.0881562 0.00367583  3.95352e-01  7.40648e-01

Use dplyr to filter the above gene-level statistics (df) for genes the 2,000 genes with the highest biological variance component.

fd <- df |>
    data.frame() |>
    arrange(bio) |>
    tail(n=2e3)
# or...
fd <- df |>
    data.frame() |>
    slice_max(bio, n=2e3)

Reproduce the below scatter plot (geom_point()) of gene expression means vs. total variance estimates. You will need a 2nd layer to highlight the top-2,000 genes selected above (red points), and a 3rd layer to add the technical variance estimate (blue dashed line).

ggplot(df, aes(x=mean, y=total)) +
    geom_point() +
    geom_point(data=fd, col="red") +
    geom_line(aes(y=tech), col="blue", lty=2)

The genes selected above correspond to genes with a large deviation from the baseline level of (technical) variation, i.e., they tend to have a large biological variance component:

ggplot(df, aes(x=mean, y=bio)) +
    geom_point() +
    geom_point(data=fd, col="red") +
    geom_hline(yintercept=0, col="blue", lty=2)

Dimension reduction

In standard scRNA-seq data anlysis pipelines, these “highly variable features” are subjected to principal component analysis (PCA) in order to identify major axes of variation. We can perform PCA on a SingleCellExperiment using runPCA() (scater package), where we specify the subset of features to use via argument subset_row:

# get selected features
length(sel <- rownames(fd))
[1] 2000
# perform principal component analysis
sce <- runPCA(sce, subset_row=sel)

By default, runPCA() computes 50 PCs that are stored in the input object as a reducedDim() slot:

# rows=cells, columns=PCs
pcs <- reducedDim(sce, "PCA")
pcs[1:5, 1:5]
                       PC1      PC2       PC3        PC4        PC5
AAACATACAATGCC-1 -7.557107 2.385930 -1.413980  2.0673622  1.4634198
AAACATACATTTCC-1  8.666491 9.109722 -2.271693 -2.6015258 -1.8624359
AAACATACCAGAAA-1 11.916599 9.813017  2.072871  1.4183483  1.6350212
AAACATACCAGCTA-1  8.601655 9.003323  1.269496  2.5980889  0.4418282
AAACATACCATGCA-1  0.301985 2.381974  5.205035 -0.7592489 11.3635477
dim(pcs) 
[1] 26820    50

Construct a data.frame that includes both, PCs and cell-level metdata (colData). Generate a scatter plot of PCs 1 and 2 as shown below, with cells colored by experimental condition and cell subpopulation, respectively; x and y axes should have a fixed aspect ratio!

df <- data.frame(pcs, colData(sce))
ggplot(df, aes(PC1, PC2, col=stim)) +
    geom_point(size=0.2) + coord_equal() +
    guides(col=guide_legend(override.aes=list(size=2))) 

ggplot(df, aes(PC1, PC2, col=cell)) + 
    geom_point(size=0.2) + coord_equal() +
    guides(col=guide_legend(override.aes=list(size=2)))

It’s pretty clear from the plots above that experimental condition (stim) and cell subpopulations seem to be considerable drivers of gene expression variability. But what is driving these differences? We can explore this visually by coloring cells by other cell metadata, e.g., library size (total sum of counts).

Generate a scatter plot of PCs 1 and 2 with cells colored by log-library size. Taking into consideration the plots generated above, what is the main driver of gene expression variation (PC1)?

ggplot(df, aes(PC1, PC2, col=log10(sum))) +
    geom_point(size=0.2) + coord_equal() +
    scale_color_viridis_c()

Finally, note that the dataset retrieved from ExperimentHub already contains a non-linear embedding, t-SNE, that has been pre-computed by the original authors. We can access this representation as follows, and you are welcomed to explore visually exploring the data on your own, e.g.,

  • color cells by gene expression (logcounts())
  • plot PCs (linear) again t-SNE dimensions (non-linear)
  • facet_wrap() by stim/ind/cell to split plots according to some metadata variable
map <- reducedDim(sce, "TSNE")
df <- cbind(df, map)
ggplot(df, ...)